Introduction

Diabetes, according to the World Health Organization, is a chronic disease that occurs when blood glucose levels are abnormally high, posing a risk to several body organs such as the heart, eyes, kidneys, and lungs (World Health Organization 2020). There is currently no cure for diabetes, but it is possible to predict an individual's likelihood of developing the disease using medical indicators such as blood pressure, glucose levels, insulin levels and genetics.

Recognizing the importance of early diabetes diagnosis, this project utilised the Pima Indians diabetic dataset from the machine learning repository at the University of California, Irvine to develop a supervised classification machine learning model. Through the application of data science, this initiative aims to assist medical practitioners in increasing the life expectancy of women in the community.

david-moruzzi-OfwwIZPfaDQ-unsplash.jpeg

Table of Contents

1.0 Data pre-processing

1.1 Variables Description
1.2 Checking number of rows and columns
1.3 Checking data types and distributions
1.4 Handling missing values
1.5 Detect Outliers
1.6 Conclusion

2.0 Formulate machine learning tasks

2.1 Exploratory Data Analysis (EDA)
    2.1.1 Explore distribution of Non-Diabetic (0) and Diabetes (1) data in each independent parameter
    2.1.2 Relationship between features
    2.1.3 Formulate hypothesis
2.2 Identify learning problems for machine learning

3.0 Specify three learning algorithms

3.1 K-Nearest Neighbour (KNN)
3.2 Decision Tree
3.3 Random Forest

4.0 Select data partitioning

4.1 Ratio selection
4.2 Data partitioning process

5.0 Model development

5.1 Expanding hyperparameters
5.2 K-fold Cross Validation
    5.2.1 Splitting the training data set in section 4.2 into train and validation sets
    5.2.2 Select number of KFold
    5.2.3 Validation performance for each learning algorithm
    5.2.4 Select best model performance

6.0 Performance assessment

 6.1 Hyperparameter tuning
 6.2 Evaluate model accuracy score and confusion matrix

1.0 Data pre-processing

1.1 Variables description

Based on the information provided by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK https://www.niddk.nih.gov/), which collected and contributed this data collection, as well as the analysis from Assignment 2, the following table describes each variable in this data set.

image-2.png

1.2 Checking number of rows and columns

1.3 Checking data types and distributions

Findings:

1. Number of rows:

2. Number of columns:

3. Data types:

In conclusion, this data set is sufficient for developing supervised machine learning classification models.

1.4 Handling missing values

Findings:

  1. None of 09 columns have null values.
  2. 07 attributes have a minimum value of 0.

These 0 values may have been encoded from raw Null data during the prior data acquisition process. Consequently, to continue investigation, we must convert 0 to Null.

Findings:

Among 5 columns contain Null values:

Handling missing data by approriate Mean and Median values

Findings:

The project data set no longer contains missing values.

1.5 Detect Outliers

Findings:

1.6 Conclusions

In the end of the data preprocessing process, the project_data has been modified so that:

This data set is now ready for the model development process.

2.0 Formulate machine learning tasks

2.1 Exploratory Data Analysis (EDA)

2.1.1 Explore distribution of Non-Diabetic (0) and Diabetes (1) data in each independent parameter

Findings:

2.1.2 Relationship between features

Using Scatter Plot to identify relation among all variables

Using Heat Map to identify relation among all variables

2.1.3 Formulate hypothesis

Hypothesis 1: Which variables increase the likelihood of diabetes in women?

Glucose, BMI, Pregnancies, Diabetes Pedigree Function and Age are primary medical indicators that have a positive correlation with the development of diabetes. This can be discussed in detail as follows:

On the other hand, blood pressure, insulin, and skin thickness also affect the likelihood of developing diabetes, but not significantly.

Hypothesis 2: How does pregnancy cause detrimental effects to women clinical parameters?

Insulin, Skin Thickness, and Diabetes Pedigree Function all correlate negatively with Pregnancies. In particular, during pregnancy, insulin levels in female's body frequently decline because of a hormone generated by the placenta. Additionally, as the size of the foetus increases, the skin of pregnant women grows thinner. The more the number of pregnancies a woman has, the less elastic her skin gets (Australia Government - Department of Health 2019).

Furthermore, as previously mentioned, some women do not have diabetes before they get pregnant; they develop gestational diabetes only during pregnancy, and the majority of these women return to being nondiabetic after giving birth (Diabetes Australia 2020). Therefore, the inverse correlation between Pregnancies and Diabetes Pedigree Function implies that not all examined women in the Pima diabetes data set contain the hereditary diabetes gene.

2.2 Identify learning problems for machine learning

In order to forecast the likelihood of diabetes in women based on the Pima diabetes data set, this project required the development of a SUPERVISED machine learning model employing CLASSIFICATION algorithms that meet the following criteria:

3.0 Specify three learning algorithms

02 learning algorithms were used in Assignment 2

3.1. K-Nearest Neighbour (KNN)

3.2. Decision Tree

Adding one more learning algorithms in this project

3.3 Random Forest

4.0 Data partitioning

4.1 Ratio selection

Most scientists on research gate (topic and topic) and data scientists on stackoverflow suggest that 70:30 and 80:20 are the most often used ratios in machine learning data partitioning.

Because the training data set must be sufficiently large to implement k-fold cross validation, the ratio 80:20, commonly known as the Pareto principle, has been chosen for this report. Hence, the procedure for dividing data comprises two steps:

4.2 Data partitioning process

Splitting the dataset into dependent and independent features

Satisfying the machine learning tasks 2.2 - Selecting 08 features Glucose, BMI, Pregnancies, Diabetes Pedigree Function, Age, Blood Pressure, Insulin, and Skin Thickness to develop model

Finding:

08 independent features all have different ranges, we need normalizing them to be between 0 and 1

Scaling the independent features by MinMaxScaler

Satisfying the machine learning tasks 2.2 - Maintaining the data distribution in each independent parameter

Stratified splitting the dataset into 80% training set and 20% testing set

Satisfying the machine learning tasks 2.2 - Preserve the imbalance of classes in the Outcome column after partitioning the dataset into train and test sets

As stated in subsection 2.1, there is an imbalance between the number of diabetes and non-diabetes instances in the label column of the Pima diabetes data set. Therefore, one of the objectives for machine learning task 2.2 is to divide the dataset into train and test sets while preserving the same proportions of samples in each class in label column as seen in the original dataset.

This is possible by utilising the train_test_split() function and setting the stratify argument to the y component of the initial dataset (Scikit learn Documentation).

5.0 Model development

5.1 Expanding hyperparameters

Learning Algorithm 1: K-Nearest Neighbour (KNN)

Learning Algorithm 2: Decision Tree

Learning Algorithm 3: Random Forest

5.2 K-fold Cross Validation

5.2.1 Splitting the training data set in section 4.2 into train and validation sets

5.2.2 Select number of KFold

As described in the book An Introduction to Statistical Learning, 5 and 10 folds are the most often employed fold numbers. This number of folds has been demonstrated to produce the ideal balance between bias and variance.

kfold = 5 will be used in this project.

5.2.3 Validation performance for each learning algorithm

Learning Algorithm 1: K-Nearest Neighbour (KNN)

Learning Algorithm 2: Decision Tree

Learning Algorithm 3: Random Forest

5.2.4 Select best model performance

Findings:

Random Forest is the most effective learning algorithm with an average accuracy score over folds of 0.83, followed by Decision Tree with an average accuracy score over folds of approximately 0.72, and K-nearest Neighbors with a score of 0.69.

This can be explained by the fact that regression models such as K-nearest Neighbors are frequently highly sensitive to outliers, hence data sets containing outliers, such as the Pima diabetes data set, will frequently not provide reliable predictions.

Random Forest is the final model chosen to be applied to a test data set in the next section.

6.0 Performance assessment

6.1 Hyperparameter tuning

6.2 Build model with best value of hyperparameter and evaluate model Accuracy score and Confusion Matrix

Add 01 more column named "Prediction" into label column in test set

Save Random Forest model

Recommendations

Despite the fact that Random Forest performs well with outliers, like other tree-based models, it has a potential to overfit training data. To enhance the model, I come up with some suggestions:

References:

[1] https://www.health.gov.au/health-topics/chronic-conditions/what-were-doing-about-chronic-conditions/what-were-doing-about-diabetes

[2] https://www.diabetesaustralia.com.au/

[3] https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

[4] https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

[5] https://www.datacamp.com/courses/machine-learning-with-tree-based-models-in-python

[6] https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

[7] https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

[8] https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split

[9] https://www.statology.org/k-fold-cross-validation/